Process Data from 2018 into a consistent format.

This notebook brings the 2018 into alignment with the desired format with respect to field name, type, and grouping.

Rename

Naming rules:

Sanatize ID columns as needed

Rearrange columns

Sanitize Non-ID columns

Sanitization functions

The pattern to use is:

  1. Alter the dataframe
  2. Test the dataframe against expectations

The main tasks that need to be completed are:

  1. Identify values that can't be converted to the expected data type. The "find_unconvertable_" family of functions should be used.

    1. find_unconvertable_datetimes
  2. For simple renaming (e.g. misspellings) or splitting non-tidy data into two rows ("entry1-entry2" -> "entry1", "entry2") use sanitize_col

  3. Move values that are ambigous but pertain to data imputation to "Imputation_Notes" using relocate_to_Imputation_Notes

  4. If new columns need to be added (e.g. mgmt.Ingredient for parsed components of Product (e.g. elements) ) this should be accomplished with safe_create_col.

  5. Any one off changes should be accomplised manually.

  6. Confirm columns match the expected types with check_df_dtype_expectations, and report mismatches.

These steps should be completed for each dataframe in turn to minimize the cognitive load of the reader.

Sanitization: Column data type expectations

Note: to handle missing values some columns that would otherwise be ints are floats

Sanitization: Alter entries

Static values (within season)

Datetime containing columns

Simple Columns

Check Success

Weather

Datetime

Rainfall_Unit_mm

Data_Cleaned

Simple Columns

Check Success

Management

Date_Datetime

Amount_Per_Acre

Ingredient

This is to be the cleaned up version of the "Product" column

Simple Columns

Check Success

Publish